Data Dictionary

In this study, Personal_Loans serves as our variable of interest. We will be matching uncovered key variables and explore how they cluster around the people that accepted the offer. Special interest will be given to customers that are already depositors, as declared by Securities_Account and CD_Account

No missing or null values, YAY!

Given that most numerical values with >50 unique values can be considered continuous, we should observe the other variables to determine if they can be grouped or continued to be treated as continuous (e.g. age). ZIP Code values could potentially also be turned into categoricals.

Given the amount of inividual ZIP Codes we can potentially group them for smaller groups.

From Wikipedia: ZIP Codes are numbered with:

So potentially, by using the second and third digits we could potentially group them effectively.

There is also a wide spread in max values for variables, so for our regression models the data will need to be scaled. This will be stored in its own dataframe.

Univariate Analysis

Just from looking at these histograms, it doesn't seem clear what are the distributions of customers for this bank. It is critical to assess what is the make up of customers and then assess the distribution of customers and determine if there is any correlation between the kind of customers and the acceptance of the previous offer.

Using a Venn diagram, we can more clearly see the overlap between customers. We see that most of the businesses clients are Credid Card users with 59.66%. This is followed by people with Mortgages. However, very few people interact with the bank exclusively through Mortgages, but rather (with 26%) the second highest clientele group is customers with Mortgages that also have Credit Cards.

Finally, we see that similar to the histograms above, very few of the businesses customers are composed of people owning accounts.

We shall now separate these distributions between people that didn't take the offer, and more importantly for people that did.

Very clearly, we see that the customers that took the offer are EXCLUSIVELY credit card users. This will likely play a big effect as we move forward.

We can observe that the Personal Loan acceptance percentages increase for the clients that had more than one time of account with the bank when compared to the clients that only had one account. For example, the percentage of people only having a credit card account, a mortgage, a savings account or a checking account went from 61.14%, 0.56%, 0.16% and 0.2% respectively for clients who did not take the offer to 45.93%, 0%, 0% and 0% respetively for clients that did accept the offer. Conversely, there seems to be a patterns that the ratio for clients that had 2 or more accounts increased for clients that accepted the offer (e.g. the ratio of clients that had both a checking account and a mortgage went from 0.81% to 10.65% between the clients who didn't accept the offer to people who did).

This implies that the distribution of account category of the client might be more informative than the actual level of usage.

Before adding this variable, we will construct our model with the present data.

We can also compare the distributions of customers between the general population and distributions for customers that accepted the offer.

Below we perform multivariate analysis to further visually assess relations between the variables.

Comparing the data between general customers versus customers who accepted the offer we see clear differences in the distribution. While age and experience seem to be fairly irrelevant, we can see that the income distribution for people who accepted the offer is substantially higher (IQR = [125 - 175]) than for the general customer (IQR = [50 - 100]). Furthemore there seems to be a slight shift towards larger families (3-4 members) in customers that accepted. In an expected fashion, customers that accepted the offer also incurred in higher credit card spending (~4k/mo) versus the general customer (~1.5k/mo).

Looking at these distributions, it would seem that the variables that will be least impactful/informative to the model will be:

All other variables seem to be valuable predictive variables.

Furthermore, we can observe that there is no concern regarding outliers that would affect our model.

We can notice two points for concern:

Because of the categorical nature of this variable, one-hot encoding would normally be necessary to run our logistic regression. However, we can observe that this variable contains very little information, so this variable is likely to be removed during model performance evaluation.

At this point, the data is ready for a decision tree model. However, to start with our logistic regression model, we will copy our DOI dataframe and optimize for a logistic regression.

Logistic Regression

At this point, our dataframe is ready for logistic modeling:

Building the model

We are developing a model that uses the independent variables to determine predictors for whether a client would accept the a personal loan offer or not.

Concerns:

Therefore, we should be attempting to increase F1_score value

Model Performance Improvement through further data engineering

Removing those variables, marginally negatively affects performance parameters

Summary of first round of model performance improvement

So it seems that the first dataframe gave the best fit performance.

We will use AUC-ROC curve to try and increase F1 performance

ROC-AUC on training set

ROC-AUC on testing set

Model Performance Improvement

Optimal threshold using AUC-ROC curve

Optimal thresold is the value that best separated the True positive rate and False positive rate.

This threshold results in a large increase in Recall value, with a notable decrease in Precision (resulting in an overall decrease of F1 parameter). This might be highly advantageous since it might be better to capture as many of the true positives as possible, at the cost of targetting more customers that are unlikely to take the offer depending on what the cost would be for including these customers in the campaign.

We can still try to obtain a better threshold:

Using Precision-Recall curve and see if we can find a better threshold

Conclusion

We can see that using the optimal threshold as per AUC-ROC curve to build our logistic regression model results in the maximum Recall value, while all our approaches show only marginal effect of the F1 parameter. Creating our predictive model for customers that would take the personal loan offer with this threshold would result in the highest number of likely candidates at the expense of targeting customers that are unlikely to take the offer.

Decision Tree

We will first construct a decisiong tree with the best performing dataframe from our logistic regression model. The other DF's will be tested if performance suggests it would be beneficial

We can see that for both the trainig and, more importantly, the testing set, the model has a really high accuracy. Furthermore, this approach of classification results in a high recall score, which means we are accurately predicting customers that would accept the personal loan offer.

Observing most important features

This model is very complex, so using GridSearch will help us assess the best parameters to conduct our decision tree without risking underfitting.

However, before that, we can see if the qualifiers of the type of customer can be advantageous parameters to construct our decision tree.

Despite the observations of the distribution of costumers seen in the Venn Diagrams, these parameters are not showing any signs of contributing to the predictive performance of the model. So whiel interesting, it is fair to assess that the primary prepared dataframe was effectively engineered as is.

We will return the variables to the originals and continue.

Using GridSearch for Hyperparameter tuning

Conclusion

We are able to develop a decision tree model with only 4 layers of depth with a performance recall score of 0.85. We see that our model is able to very accurately (+0.98) determine which kind of customer is very likely to accept the personal loan. We also observe that our Decision Tree model had a much higher performance than the Logistic Regression model. Primarily we can assess that income is the biggest contributing factor to targetting potential personal loan applicants, with education, family size, credit card usage and having a checking account with the bank being the other contributing factors. Using our Decision Tree model we can generally summarize the profile of the customer that would take the offer:

From our models, we would recommend that information is further obtained from new or untargeted customers and ran through both models. Using the logistic regression model, more customers that would match those that didn't take the offer in the past would be targeted, given that it is prone to false positives. If stretching the reach of the campaign would not be too expensive, this could be a desirable approach.

For a more targeted and better performing prediction, the decision tree should be used to target new customers. If the performance of our model is in fact sustainable, only ~1.5% of the untargeted customers would have accepted the offer.

The potential to raise the personal loan acceptance rate to +10% should be motivation for further revising the model if a secondary development phase were to arise.